Penguins dataset

Import libraries

library(ggplot2)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.1     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ lubridate 1.9.2     ✔ tibble    3.2.0
## ✔ purrr     1.0.1     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(palmerpenguins)
library(ggthemes)

Create plot

penguins_plot = ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
  geom_point(aes(color = species, shape = species)) +
  geom_smooth(method = "lm") +
  labs(
    title = "Body mass and flipper length",
    subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins",
    x = "Flipper length (mm)", y = "Body mass (g)",
    color = "Species", shape = "Species"
  ) +
  scale_color_colorblind()
penguins_plot
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values (`geom_point()`).

2.2.5 Exercises

1. How many rows are in penguins? How many columns?

nrow(penguins) #rows
## [1] 344
ncol(penguins) #columns
## [1] 8

2. What does the bill_depth_mm variable in the penguins data frame describe?

?penguins

Answer: bill_depth_mm a number denoting bill depth (millimeters)

3. Make a scatterplot of bill_depth_mm vs. bill_length_mm. That is, make a scatterplot with bill_depth_mm on the y-axis and bill_length_mm on the x-axis.

ggplot(
  data = penguins,
  mapping = aes(x = bill_length_mm, y = bill_depth_mm, color = species)
) +
  geom_point()
## Warning: Removed 2 rows containing missing values (`geom_point()`).

4. What happens if you make a scatterplot of species vs. bill_depth_mm? What might be a better choice of geom?

ggplot(
  data = penguins,
  mapping = aes(x = species, y = bill_depth_mm, color = species)
) +
  geom_col()
## Warning: Removed 2 rows containing missing values (`position_stack()`).

5. Why does the following give an error and how would you fix it?

#ggplot(data = penguins) + 
#  geom_point()

Answer: We need to add values for ‘x’ and ‘y’

Error in geom_point() : ℹ Error occurred in the 1st layer. Caused by error in compute_geom_1(): ! geom_point() requires the following missing aesthetics: x and y

6. What does the na.rm argument do in geom_point()? What is the default value of the argument? Create a scatterplot where you successfully use this argument set to TRUE.

Answer: The na.rm argument in geom_point() is used to remove missing values (NA) from the data before plotting. When na.rm = TRUE, any rows containing missing values in the x or y variables will be removed from the plot. If na.rm = FALSE (the default), any rows containing missing values will be plotted with missing values as points.

ggplot(
  data = penguins,
  mapping = aes(x = bill_length_mm, y = body_mass_g)
) +
  geom_point(na.rm = TRUE)

7.Add the following caption to the plot you made in the previous exercise: “Data come from the palmerpenguins package.” Hint: Take a look at the documentation for labs().

ggplot(
  data = penguins,
  mapping = aes(x = bill_length_mm, y = body_mass_g)
) +
  geom_point(na.rm = TRUE) +
  labs(caption = "Data come from the palmerpenguins package.")

8.Recreate the following visualization. What aesthetic should bill_depth_mm be mapped to? And should it be mapped at the global level or at the geom level?

ggplot(data = penguins, aes(x = flipper_length_mm, y = body_mass_g, color = bill_depth_mm)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Body mass vs. flipper length colored by bill depth",
       x = "Flipper length (mm)",
       y = "Body mass (g)",
       color = "Bill depth (mm)")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
## Warning: The following aesthetics were dropped during statistical transformation: colour
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## Warning: Removed 2 rows containing missing values (`geom_point()`).

9. Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions.

ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g, color = island)
) +
  geom_point() +
  geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values (`geom_point()`).

Answer: Without method = “lm” curves are not smooth. The se = FALSE argument in geom_smooth() is used to turn off the display of confidence intervals around the smoothing line.

10. Will these two graphs look different? Why/why not?

ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
  geom_point() +
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values (`geom_point()`).

ggplot() +
  geom_point(
    data = penguins,
    mapping = aes(x = flipper_length_mm, y = body_mass_g)
  ) +
  geom_smooth(
    data = penguins,
    mapping = aes(x = flipper_length_mm, y = body_mass_g)
  )
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
## Removed 2 rows containing missing values (`geom_point()`).

Answer: These two graphs will produce the same plot.

2.4.3 Exercises

1. Make a bar plot of species of penguins, where you assign species to the y aesthetic. How is this plot different?

ggplot(penguins, aes(y = species)) +
  geom_bar()

Answer: In this plot, the aes mapping assigns the species names to the y axis of the plot.

2. How are the following two plots different? Which aesthetic, color or fill, is more useful for changing the color of bars?

ggplot(penguins, aes(x = species)) +
  geom_bar(color = "red")

ggplot(penguins, aes(x = species)) +
  geom_bar(fill = "red")

Answer: Definitely fill is more useful.

3. How are the following two plots different? Which aesthetic, color or fill, is more useful for changing the color of bars?

Answer: The bins argument in geom_histogram() specifies the number of bins to use for dividing the range of the data into intervals, which are then used to construct the histogram.

4. Make a histogram of the carat variable in the diamonds dataset that is available when you load the tidyverse package. Experiment with different binwidths. What binwidth reveals the most interesting patterns?

Option 1: binwidth = 2

ggplot(diamonds, aes(x = carat)) +
  geom_histogram(binwidth = 2)

Option 2:binwidth = 1

ggplot(diamonds, aes(x = carat)) +
  geom_histogram(binwidth = 1)

Option 3: binwidth = .1

ggplot(diamonds, aes(x = carat)) +
  geom_histogram(binwidth = .1)

Option 4: binwidth = .5

ggplot(diamonds, aes(x = carat)) +
  geom_histogram(binwidth = .2)

Answer: IMO option number 3 is the best and represents the best distribution of variables.

2.5.5 Exercises

1. The mpg data frame that is bundled with the ggplot2 package contains 234 observations collected by the US Environmental Protection Agency on 38 car models. Which variables in mpg are categorical? Which variables are numerical? (Hint: Type ?mpg to read the documentation for the dataset.) How can you see this information when you run mpg?

?mpg

Categorical:

  • manufacturer

  • model

  • trans

  • drv

  • fl

  • class

Numerical:

  • displ

  • year

  • cyl

  • cty

  • hwy

To see this information when you run mpg in R, we can use the str function.

str(mpg)
## tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
##  $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
##  $ model       : chr [1:234] "a4" "a4" "a4" "a4" ...
##  $ displ       : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
##  $ year        : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
##  $ cyl         : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
##  $ trans       : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
##  $ drv         : chr [1:234] "f" "f" "f" "f" ...
##  $ cty         : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
##  $ hwy         : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
##  $ fl          : chr [1:234] "p" "p" "p" "p" ...
##  $ class       : chr [1:234] "compact" "compact" "compact" "compact" ...

2. Make a scatterplot of hwy vs. displ using the mpg data frame.

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point()

Next, map a third, numerical variable to color,

ggplot(mpg, aes(x = displ, y = hwy, color = cyl)) +
  geom_point()

then size

ggplot(mpg, aes(x = displ, y = hwy, size = cyl)) +
  geom_point()

then both color and size,

ggplot(mpg, aes(x = displ, y = hwy, size = cyl, color = cyl)) +
  geom_point()

then shape

It’s not possible to assign numerical variable to shape parameter.

ggplot(mpg, aes(x = displ, y = hwy, size = cyl, color = cyl, shape = trans)) +
  geom_point()
## Warning: The shape palette can deal with a maximum of 6 discrete values because
## more than 6 becomes difficult to discriminate; you have 10. Consider
## specifying shapes manually if you must have them.
## Warning: Removed 96 rows containing missing values (`geom_point()`).

How do these aesthetics behave differently for categorical vs. numerical variables?**

Answer: When mapping aesthetics to categorical variables, each category is given a unique value for the aesthetic (e.g., a unique color or shape). This allows us to distinguish between the categories, but there is no inherent ordering or relationship between the categories.

When mapping aesthetics to numerical variables, the values of the aesthetic are based on the values of the numerical variable. For example, when we mapped the engine displacement (displ) to size, the larger the engine displacement, the larger the size of the point. This allows us to see patterns in the data based on the values of the numerical variable.

3. In the scatterplot of hwy vs. displ, what happens if you map a third variable to linewidth?

ggplot(mpg, aes(x = displ, y = hwy, linewidth = year)) +
  geom_point()

Answer: Nothing special (?)

4. What happens if you map the same variable to multiple aesthetics?

Answer: We will get observations arranged in a straight line.

5. Make a scatterplot of bill_depth_mm vs. bill_length_mm and color the points by species. What does adding coloring by species reveal about the relationship between these two variables? What about faceting by species?

ggplot(penguins, aes(x = bill_depth_mm, y = bill_length_mm)) +
  geom_point(aes(color = species)) +
  facet_wrap(~species)
## Warning: Removed 2 rows containing missing values (`geom_point()`).

Answer: Doesn’t seem like facet_wrap(~species) very useful, we’ll get 3 graphs for each species separately.

6. Why does the following yield two separate legends? How would you fix it to combine the two legends?

ggplot(
  data = penguins,
  mapping = aes(
    x = bill_length_mm, y = bill_depth_mm, 
    color = species, shape = species
  )
) +
  geom_point() +
  labs(color = "Species")
## Warning: Removed 2 rows containing missing values (`geom_point()`).

Answer: We can remove “color =”

ggplot(
  data = penguins,
  mapping = aes(
    x = bill_length_mm, y = bill_depth_mm, 
    color = species, shape = species
  )
) +
  geom_point() +
  labs("Species")
## Warning: Removed 2 rows containing missing values (`geom_point()`).

7. Create the two following stacked bar plots. Which question can you answer with the first one? Which question can you answer with the second one?

ggplot(penguins, aes(x = island, fill = species)) +
  geom_bar(position = "fill")

ggplot(penguins, aes(x = species, fill = island)) +
  geom_bar(position = "fill")

Answer: The first stacked bar plot shows the proportion of penguin species on each island. This plot can answer the question “What is the distribution of penguin species across the different islands?”

The second stacked bar plot shows the proportion of penguins on each island for each species. This plot can answer the question “How are the penguins distributed across the different islands for each species?”

2.6.1 Exercises

1. Run the following lines of code. Which of the two plots is saved as mpg-plot.png? Why?

ggplot(mpg, aes(x = class)) +
  geom_bar()

ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point()

ggsave("mpg-plot.png")
## Saving 7 x 5 in image

Answer: The second plot is saved as mpg-plot.png, which shows a scatterplot of cty vs hwy. This is because ggsave saves the last plot that was created using ggplot2.

2. What do you need to change in the code above to save the plot as a PDF instead of a PNG? How could you find out what types of image files would work in ggsave()?

To save the plot as a PDF instead of a PNG, you can change the file extension in the ggsave() function to .pdf.

ggsave("mpg-plot.pdf")
## Saving 7 x 5 in image

To find out what types of image files would work in ggsave(), you can use the ?ggsave command to bring up the documentation for the function.